Pop quiz! (No marks, just check your understanding. It’s…tricky.)
This dataset records abalone from the coast of Tasmania, Australia (Nash, 1995) and was accessed from the [UCI Machine Learning Repository]https://archive.ics.uci.edu/dataset/1/abalone.
Introduction
Abalone are marine snails that are a considered a delicacy and very expensive. The older the abalone, the higher the price. Age is determined by counting the number of rings in the shell. To do this, the shell needs to be cut, stained and viewed under a microscope - which is a lot of effort. Researchers measured 9 attributes of the abalone: sex, length, diameter, height, whole, shucked, viscera, shell, and rings.
Note: whole, shucked, viscera and shell are weight measurements.
What is the response variable?
length
rings
shell (weight)
whole (weight)
Reading comprehension :)
Research question
Abalone are marine snails that are a considered a delicacy and very expensive. The older the abalone, the higher the price. Age is determined by counting the number of rings in the shell. To do this, the shell needs to be cut, stained and viewed under a microscope - which is a lot of effort. Researchers measured 9 attributes of the abalone: sex, length, diameter, height, whole, shucked, viscera, shell, and rings.
Note: whole, shucked, viscera and shell are weight measurements.
What is the best research question, based on the context above?
Is there a correlation between abalone age and weight?
Can abalone weight be predicted from other measured variables?
Is there a relationship between abalone size and age?
Can age be measured by size?
A is not a complete answer, there are many more predictors. B is incorrect, we care about age/rings. D is incorrect, we are trying to model or predict abalone age from size – terminology matters.
Explore data
We sample the data to make it easier to visualise relationships. We also remove the sex variable because it is not numeric.
Code
abalone <-read.csv("data/abalone.csv")set.seed(1113) # reproducible randomnessabalone <- abalone %>%select(-sex) %>%# remove `sex` because it is categoricalsample_n(100) # sample 100 observations for cleaner curvestr(abalone)
In the ring plot, the relationship is not clearly linear and there is fanning (unlikely equal variances met). There is also very high correlation between some predictors (e.g. length and diameter), so the answer is D.
Fit a model
We use natural log transformation on the response variable with log() to account for non-linear relationships.
Code
fit <-lm(log(rings) ~ ., data = abalone)summary(fit)
Whole (weight) has a period (.) beside the p-value – this means the value is less than 0.10, but it needs to be <0.05 to be considered significant.
Fit a model
Residual standard error:0.1996 on 92 degrees of freedomMultiple R-squared:0.6187, Adjusted R-squared:0.5897F-statistic:21.32 on 7 and 92 DF, p-value:<2.2e-16
We determine model fit with:
Multiple R-squared and p-value
Adjusted R-squared and p-value
Adjusted R-squared and residual standard error
Multiple R-squared and residual standard error
We have multiple variables, so we use the Adjusted R-squared. The p-value tests the hypothesis on whether the model should be used at all, in favour of the mean. The residual error is a measure of model fit.
The problem with using too many predictors
Here, the model is fit with all predictors, then the least significant predictor is removed. This process is repeated until only one predictor remains.
Considering only R^2, which model would we choose?
Model with 1 predictor
Model with 3 predictors
Model with 4 predictors
Model with 7 predictors
The 1-predictor model sacrifices 14.5% of variation in the response (too much). The 7-predictor model is overfitted (worse than 4-predictor model). Between 3 and 4-predictor models - is a 0.7% improvement worth having to measure height? Realistically, the models with 2 or 3 predictors are justifiable.
Interpretation
#| eval: falseCall:lm(formula =log(rings) ~ diameter + shucked + shell, data = abalone)Residuals: Min 1Q Median 3Q Max -0.30290-0.15469-0.034850.114540.64573Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.41220.15948.8594.19e-14***diameter 2.03460.60343.3720.00108**shucked -1.33390.2152-6.2001.42e-08***shell 2.04860.36725.5792.23e-07***---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Which is the correct equation?’
log(rings) = 1.41 + 2.04 x diameter - 1.33 x shucked + 2.04 x shell
log(rings) = 1.41 + 2.03 x diameter - 1.33 x shucked + 2.04 x shell
log(rings) = 1.41 + 2.04 x diameter + 1.33 x shucked + 2.05 x shell
log(rings) = 1.41 + 2.03 x diameter - 1.33 x shucked + 2.05 x shell
Attention to detail :)
Interpretation
The equation of our model is:
log(rings) = 1.41 + 2.03 x diameter + -1.33 x shucked + 2.05 x shell
Below are three statements. Given all other predictors are held constant:
rings changes by e^{-1.33} for every percent increase in shucked (weight)
log(rings) changes by 1.33 for every unit increase in shucked (weight)
log(rings) changes by approximately 1.33% for every percent increase in shucked (weight)
How many statements are correct?
none
1 statement
2 statements
all of them
The first two are correct, the third is not. The natural log percent change appoximation only applies to small \beta values below |0.25|.